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Abstract. Gaussian time-series models are often specified through their spec- 
tral density. Such models present several computational challenges, in partic- 
ular because of the non-sparse nature of the covariance matrix. We derive a 
fast approximation of the likelihood for such models. We propose to sample 
from the approximate posterior (that is, the prior times the approximate like- 
lihood), and then to recover the exact posterior through importance sampling. 
We show that the variance of the importance sampling weights vanishes as the 
sample size goes to infinity. We explain why the approximate posterior may 
typically multi-modal, and we derive a Sequential Monte Carlo sampler based 
on an annealing sequence in order to sample from that target distribution. Per- 
formance of the overall approach is evaluated on simulated and real datasets. 
In addition, for one real world dataset, we provide some numerical evidence 
that a Bayesian approach to semi-parametric estimation of spectral density 
may provide more reasonable results than its Frequentist counter-parts. 



1. Introduction 



Several models in the time series literature are defined through their spectral 
density. Assuming Gaussianity, one observes a vector x of length n, from the 
Gaussian distribution 

x\ti,f ~N(nl,T{f)) 
where 1 = (1, . . . , 1)', and T(f) is the n x n Toeplitz matrix associated to spectral 
density /, with entries T(f)(l,m) — jf(l — m), 

(1.1) 7/ (Q = r /(A)e liA dX, l = -( n -l),...,(n- 1). 

J — 7T 

The models vary with respect to the specification of /. For instance, the FEXP 



parametrisation (Hurvich et al. 2002 Moulines and Soulier 2003) assumes that 



(1.2) 



/(A) = 



1 

27 



1 - e 



-i\ 



-2d 



ffPO, g(\) = exp I Si C0S <J A ) 



This specification conveniently separates the long-range behaviour, as determined 
by parameter d €E [0, 1/2), and the short-memory part g. By taking k large enough, 
g may be arbitrarily close to any function in a certain regularity class. |Rousseau| 
et al. (20121 show that, for a well chosen prior with respect to (k, d, £i, . . .), the 
corresponding posterior is consistent for both d and /, under semi-parametric set- 
tings; that is, assuming that the true spectral density belongs to a certain infinite- 
dimensional class of functions. 
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This FEXP parametrisation will be our running example. We note however that 
many other parametrisations of / are possible, which can also be tackled by the 
methodology developed in this paper. One may take d — 0, and obtain a non- 
parametric procedure for estimating the spectral density, under a short-memory 
assumption. One may replace g by another type of expansion, a spline regression, 
and so on. Finally, one may also consider a purely parametric model, such as an 
ARFLMA(p, q) model, given by: 



/(A) 



1 

2^ 



-2d 



-ijX 



ARMA models may also be defined through their spectral density, but well-known 
specialised methods exist for such models, hence they fall outside the scope of this 
paper. 

Whatever the specification of /, parametric or semiparametric, several compu- 
tational difficulties arise regarding Bayesian inference for such models. First, the 
likelihood of the data 



(1.3) p(z| M ,/) = (2^r i/2 |T(/)| 



-1/2 



exp 



1 



(x - M 1) T T(f)- 1 (x - /il) 



involves a determinant and a quadratic form which are expensive to compute, i.e. 
the cost is 0(n 3 ) if one uses off-the-shelf methods. Second, the entries of matrix 
T(/) itself, that is, the Fourier integrals cannot be computed reliably using 

standard quadrature methods, because of the many oscillations of the integrand 
for large values of k. Third, Gibbs sampling is generally not feasible for posterior 
distributions associated to this likelihood. One usually resorts to the Metropolis- 
Hastings sampler (see e.g. Robert and Casella 2004 Chap. 7), which is difficult 



to tune in order to obtain reasonable performance. This is quite problematic in 
this context: since the likelihood is expensive, performing several pilot runs in 
order to progressively tune the sampler may be a long and tedious process for the 
user. This problem is compounded if / is specified through a trans-dimensional 
prior; for instance, in the FEXP model above, if k is random. Then one needs 
to implement an algorithm for trans-dimensional sampling spaces, such as Green's 



algorithm (Green 1995 Richardson and Green 19971, also known as the reversible 



jump sampler, which is harder to tune yet. 

Perhaps because of the above difficulties, literature on Bayesian analysis of the 
spectral density of a possibly long memory stationary process is not vast; see |Pai| 
and Ravishanker (1998 20011; Ravishanker and Ray (1997) for approaches in the 



parametric (ARFIMA) ca se, and |Petris| ( |1997[ ); |Liseo et al.| ( |200l] ; |Choudhuri et al. 
fl2004[ ); |Holan et al.|fl2009} for semi-parametric approaches. These approaches often 
rely on some approximation of the likelihood; commonly the Whittle approximation 
is used, although that approximation is not reliable at low frequencies in long range 
settings ( Robinson 1995 ) . Some of these papers establish the consistency of the 



resulting pseudo-posterior, but under quite strict hypotheses: results of |Liseo et al.| 
(20011 hold under the assumption that the true model is a fractional Gaussian 



noise model, while those of Choudhuri et al. (2004) hold under short memory. 



Even under these assumptions, there is obviously some interest for doing exact 
Bayesian inference, rather than doing some approximation, the error of which is 
hard to assess for finite sample sizes. 
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Current Frequentist approaches to estimating of the long memory d parameter 
are not necessarily entirely satisfactory either, albeit for different reasons. A com- 
mon semi-parametric approach in the Frequentist literature is to discard the higher 
frequencies in the periodogram, but this leads potentially to an important loss of 



information; see Moulines and Soulier (2003) for a discussion and an alternative 



approach. Parametric (e.g. ARFIMA) Frequentist approaches on the other hand 
are known to provide unstable and inconsistent results under model misspecific- 
ation. We give at the end of the paper a real data example where our Bayesian 
approach provides quite more satisfactory results than those provided by standard 
Frequentist procedures. 

This paper proposes a novel approach for addressing the above problems in a 
unified manner, and is organised as follows. Section [2] discusses the exact compu- 
tation the likelihood. It is seen that the cost of this operation is 0(n 3 ), but the 
constant in 0(n 3 ) is typically small, hence on modern hardware it remains possible 
to perform a reasonable number of such evaluations provided n is not too large. 
Section [3] proposes a fast approximation of the likelihood, the cost of which is es- 
sentially 0(n). This approximation scheme motivates the following approach. In 
a first step, we perform Monte Carlo simulation of the approximate posterior, that 
is, the prior times the approximate likelihood. The cost of this step is 0(n) but, 
typically, with a large constant in front of n, due to the intensive nature of Monte 
Carlo algorithms. In a second step, which is necessary only if n is not too large, we 
correct for the above approximation by doing importance sampling on a reasonable 
number of simulated samples; the cost of this second step is (9(n 3 ); however this 
time the multiplying constant in front of n 3 is small. In practice the cost of the 
second step is negligible with respect to the first step, at least for n <C 10 4 . (To 
put things into perspective, many real datasets discussed in the literature are such 
that n < 10 3 , and the largest real dataset we could find had n — 4000, see end 
of Section 5 for a discussion.) On the other hand, if n is large, we show that the 
importance sampling step becomes superfluous, because the variance of the import- 
ance sampling weights converge to as n — > +oo, under appropriate assumptions. 
Section [4] discusses a Sequential Monte Carlo algorithm (SMC) for sampling from 
the approximate posterior; see Del Moral et al. ( 2006[ ) for a general introduction 
to SMC. We shall see that the following advantages of SMC (relative to for in- 
stance MCMC) are particularly useful in this specific context. First, as explained 
above, one would like to apply the importance sampling step to a sample as small 
as possible, because the exact likelihood is expensive to compute. We shall see that 
our SMC sampler typically generates particles that are close to IID samples from 
its target distribution (in the sense that the Monte Carlo variance computed over 
repeated SMC runs for certain test functions seems close to the variance that one 
would obtain from IID particles; this point will be discussed more in detail in the 
paper). Second, as also mentioned above, one has little prior information on the 
structure of the (approximate or exact) posterior. It is easy however to make SMC 
adaptive, by learning iteratively features of the target from the sample of simulated 
particles. Third, we observe in certain settings that semi-parametric models such 
as the FEXP model, may generate multi-modal posteriors. The SMC sampler we 
propose is based on tempering ideas, and thus more able to escape from minor local 
modes. 

Section ^ illustrates the proposed approach on simulated and real data. 
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We shall use the following notations: vectors and matrices are always in bold face, 
e.g. x and X. the determinant, transpose, and trace of S are denoted respectively 
|S|, S T and tr(£). 

2. Exact computation of the likelihood 

2.1. Marginalisation. As a preliminary, we note that / is often parametrised in 
such a way that / = a 2 fg, where a is a scale parameter, and 6 is the vector of 
all remaining parameters (except /i). For instance, in the FEXP case, we may set 
a 2 = cxp(£ ), 9 = (Mk)> k = (logit(2d),£i,. . .,&), and 



(2.1) 



-2(/ 



2vr 



00 PO, 5e(A)=cxp 



cos(jA) 



(The function logit is defined as logit(a;) = log(x) — log(l — x), and is used here 
to facilitate the construction of proposal distributions, see Section |4j) It is then 
possible to marginalise out both [i and a 2 from the likelihood, provided these two 
parameters are assigned the following standard prior distribution, independently 
from 6: l/u 2 ~ Gamma(a, b), /i|cr 2 ~ N(m^,a 2 /g^). The marginal likelihood 
reads: 



p(x\9) = J p(x\^,a 2 ,6)p(fi,(T 2 )dfid(j 2 



T(f ) + -E 

9n 



-1/2 



(2.2) 



b+~( X -m„lf [TlJ ol 



-a-n/2 



(x - m M l) 



where E is the n x n matrix filled with ones. Marginalising out parameters usu- 
ally improves the performance of any sampling algorithm. We note however that 
the approach developed in this paper would work with little modification for the 
unmarginalised likelihood p(x\fi, a 2 , 6). These two likelihood functions both suffer 
from the same computational difficulties, which are described in the next section. 

2.2. Computational difficulties associated to likelihood evaluation. We re- 
view in this section the specific difficulties that arise when evaluating either the 
standard likelihood function (1.3), or the marginal version (2.2 1. First, both likeli- 



hood functions include some quadratic form yS 1 y involving the inverse ofanxn 
symmetric matrix S; in flO) , S = T(/), and in pl2] ), S = T(J ) + j-E. Second, 



both functions involve the determinant of the same matrix S. Third, in both cases, 
evaluating the entries of S requires computing simultaneously n Fourier integrals, 



see (1.1). 



Our solution to the third difficulty is described in the two next Sections. Re- 
garding the two first points, the most direct solution is to compute the Cholesky 
lower triangle of S, S = CC T . Then, one obtains the determinant by tak- 
ing the square of the product of the diagonal elements of C, and one computes 
y T Tr 1 y = (C _1 y) C~ l y as the norm of the solution (in z) of the linear system 
Cz = y, which is quickly obtained by back-substitution. The Cholesky decompos- 
ition is a 0(n 3 ) operation. 
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For the sake of completeness, we mention briefly faster, but specialised, al- 
gorithms for solving directly the system Sz = y, in order to compute y T S _1 y. 
Given that S is Toeplitz, one may use Lc Vinson's algorithm (Levinson, 1949 Press 



et al.[ |2007j p. 96), which is 0(n 2 ). Alternatively, |Chen et al.| ( |2006| ) have de 
veloped a variant of the conjugate gradient method, based on a a particular pre- 
conditioned matrix. Their algorithm requires 0(log 3 ^ 2 n) iterations, each involving 
a FFT transform over 2n points. The overall cost is therefore 0(n log 5 / 2 n). 

Unfortunately, these alternative approaches do not provide an evaluation of the 
determinant as a by-product. Chen et al. (20061, Holan et al. ( |2009 1 approximate 
the determinant by using a particular asymptotic approximation, which we describe 
later, but obviously this approach is not entirely satisfactory when used within a 
non-asymptotic, and in particular, a Bayesian, approach. 

Perhaps more importantly, we note that the constant in front of the 0(n 3 ) 
cost of the Choleksy decomposition is typically very small, due to very efficient 
implementation in most scientific software. To give an order of magnitude, for 
n = 10 3 , 1000 of such operations takes about one minute on the first author's 
computer. This observation underpins the strategy laid out in the introduction: 
to run some Monte Carlo algorithm so as to sample from an approximation of 
the posterior, then, provided n is not too large, to correct for the approximation 
using importance sampling on a moderate (possibly sub-sampled from the first step) 
Monte Carlo sample. 



2.3. Computing the Fourier coefficients. In this section and the following, we 
consider the problem of evaluating simultaneously the n Fourier integrals defined 
in ( 1.1 1. As noted in the introduction, using standard quadrature would work very 

The 



poorly, because the integrand in (j 1 . 1 [) strongly oscillates when n gets large. 

a in 

Chap. 13, Sect. 9), 



solution we describe here seems well known in the numerical mathematics literature 
(e.g. Press et al. 2007. Chap. 13, Sect. 9), yet, to the best of our knowledge, it 
has not been used before in the time series literature. Instead, previous approaches 
(e.g. Chen et al. 2006 Holan et al. 2009[) rely o n more specific algorithms such 
as the splitting algorithm of Bertelli and Caporin ( 2002 1 . The approach described 
here has the same computational cost as such alternative approaches, that is, that 
of a FFT (Fast Fourier Transform), i.e. 0(n log n). However, we find our approach 
slightly more convenient, for the following reasons: (a) this is a generic approach, 
which requires only pointwise evaluation of the spectral density, whereas the split- 
ting algorithm requires exact expressions for the moving average coefficients (of the 
short memory part) which are specific to the considered class of spectral densities; 
for instance, it is unclear how one could use these methods if / would be specified 
through splines; (b) it is characterised by only one level of approximation (see 
below), whereas the splitting algorithm expresses first the moving averages coeffi- 
cients as an infinite sum, which must be truncated, then plug these coefficients into 
another infinite sum, which must be truncated again, so assessing the numerical 
error is slightly more delicate; and (c) in long memory settings, the terms of these 
infinite sums are supposed to decay slowly, hence one may need to truncate to a 
large number of terms. 
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Let M be a power of two, M — 2 K , such that M > 2n. We explain first how to 
compute efficiently and simultaneously the Fourier integrals 

7,(0 = g(X)e ux d\ 

for a given bounded function g, and < I < M/2. (If n < M/2, simply discard the 
extra values.) If the spectral density / itself is bounded, that is, if it corresponds to 
a short memory process, then the following method may be used directly by taking 
g = f ■ If / corresponds to a long-memory process, then / diverges at 0, and a 
minor modification is required to use the following method, see next section. 
The idea is to replace g by a linear interpolation g: 

~ / w 1 1 ^ — Am , ;^^o^ / A — Am s 

g( x ) = 2^ 9jtp{ A )+9o¥>o( A )+9mVm{ — ^ — ) 

3=0 

where A = 2ir/M, Xj = —it + jA, gj = g(\j) j = 0, . . . , M, and i/j is the linear 
interpolation kernel, i.e. ip(X) = (1 — |^|) + ; an d y>M are boundary corrections, 
the expression of which may be skipped for the rest of the discussion. Note g is 
defined on the entire real-line, and is zero outside [— 7r, w]. Applying the operator 
/ (•) e ax dX yields: 

/+oo 
g(X)e ax dX 
- OO 

- Eft/ ^(^y lx dx 

3=0 J -°° 



/+°° \ _ \ f+oo \ \ 

M^)e UX dX + g M / VM (^-*™y lx dX 
-oo ^ J — oo ^ 

{M-1 
W(IA) Y, gje ijl v + 9o a {lA)+g M a M (lA) 
3=0 

where W(X) = ^(u)e^ du, a (X) = ^{u)e^ du, a M (X) = Vm (u- 
M)e lXu du. The first sum may be computed using a FFT, in 0(M\og(M)) time 
All the other functions admit close-form expressions, given by |Press et al. ( |2007 
Chap. 13, Sect. 9): 

2(1 — cos A) W 
W(X)= ^ , a = a M = — 

The scheme above may be adapted so as to rely on a cubic (rather than linear) 
interpolation. This point is interesting in settings where the spectral density is 
parametrised in terms of cubic splines. Then, the method described here becomes 
exact. Otherwise, the accuracy of the method is determined by the size of the grid, 



M + 1. We follow Press et al. (20071 and takes M to be the smallest power of two 
such that M > 2n, and we observe in practice that it gives very accurate results 
(in the sense that larger value of M give essentially the same values) . 
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2.4. Computing the Fourier coefficients when / diverges at 0. In this Sec- 
tion, we explain how to adapt the method above for computing Fourier integrals, 
in the situations where / is a long-range dependent spectral density, 

/(A) = i-|l-e- iA r 2 %(A) 

Z7T 

where g is a bounded function, and < d < 1/2. In this case, / diverges at 0, 
hence it may not be well approximated by a piecewise linear function. To address 
this problem, we simply decompose / in two terms: 

2rr/(A) = 1 1 - e^ x f M g(Q) + | 1 - e~ iX | ~ U {g(X) - g (0)} , 

and compute each Fourier integral as a sum of two Fourier integrals corresponding to 
each terms. The Fourier integrals of the first term correspond to the autocovariance 
function of a fractionally integrated noise, which admits a close- form expression 



(Brockwell and Davis 2009 Chap. 13) 



r ( T(l+d)T{l-d) i{l>1 

2 W-J 1 in = o. 

The Fourier integrals of the second term may be computed directly using the 
method of the previous section: assuming g is differentiable at 0, the second term 

vanishes at 0, since |l - e - lX r 2d {g(X) - g(0)} ~ ;?'(0)A 1 - 2d , with 1 - 2d > 0. 



3. Approximated likelihood 



3.1. Principle. For convenience, we consider only the standard likelihood function 
(1.3 1 in this section, and furthermore we assume that /i = 0, i.e. a model with 
zero mean. The main idea of our approximation scheme is to replace T(f)~ 1 by 
T(l/47r 2 /) in the quadratic form, yielding 



p(x\f) = (2tt) 



-n/2 



T(f)\ 



-1/2 



exp 



-Vt(^ 

2 l 47r 2 r 



It is easy to see that the quadratic form within the exponential may now be com- 
puted in 0(n log n) time. We return to this point and other implementation aspects 
in the next section. 

The idea of approximating T(f)~ 1 by T(l/47r 2 /) is related to the asymptotic 
theory of Toeplitz matrices. It is in fact a common technical tool in the asymptotic 
theory of long-memory processes ( Dahlhaus] |1989| |Rousseau et al. 2012), but to 
the best of our knowledge it has not been used for computational reasons before. 

Now assume that / is parametrised in some way, / = fe, and denote p(x\f) = 
p(x\0), p(x\f) = p(x\6). As explained in the introduction, our strategy boils down 
to sample from the approximate posterior n n (6\x) cx p(6)p(x\9) , then to perform 
importance sampling from the approximate posterior to the true posterior, that is, 
to assign some weight to any simulation from the approximated posterior, with a 
weight function defined as: 



A p(x\6) 

= = eX P 



1 1 



p(x\e) 

The following theorem justifies this strategy. 



1 



x 
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Theorem 1. Consider the FEXP model, as defined by (1.2), and let n n denote 
the approximated posterior distribution defined as Tt n {6) oc p(9)p(x\9) , where p{6) 
is some prior density with respect to parameter 9. Then, under certain conditions 
(listed in the Appendix) on the prior distribution p{9) and the true distribution of 
x, with associated spectral density f D , one has 

E*» [w C orr(9)} = W° Corr {l + P (1)} ] 
E 1 " [wCorAOf] = (w° Corr ) 2 {1 + op(l)} , 

where w° Corr = p(x\f )/p(x\f ). 

The proof and the technical conditions on the prior and the true spectral density 
are given in the Appendix. Because this theorem relies heavily on technical results 



of Rousseau et al. (20121, it is restricted to the semi-parametric FEXP model 
presented in the introduction, but with zero mean. We believe it could be extended 
to other classes of models with some extra effort. Note that the true spectral density 
is not assumed to belong to a fixed-dimension FEXP parametric model (for some 
fixed k, in the notations used in (??)), but, with greater generality, to a certain 
Sobolev class of infinite dimension; see the beginning of the Appendix for a more 
precise statement. 

In practical terms, this theorem says that the variance of the weights goes to 
zero as n goes to infinity, or. in other words, that the importance weights become 
nearly constant as n goes to infinity. 

3.2. Practical implementation. In this section, we work out a practical imple- 
mentation of the approximation scheme proposed in the previous section. We now 



turn our attention to the marginalised likelihood defined in (2.2 1. 

First, we simplify the quadratic form by ignoring the uncertainty with respect 
to /i, i.e. by taking m M = x, = +oo, so that the likelihood simplifies to 

f i \ — a+n/2 

I-V2 |\ , l ~Trnf-r ^ 



p(x\d) oc \T(f g )\ n+-x 1 T{f 9 )^x\ , X = X-Xl. 

We observe that this particular approximation is quite accurate in practice, which 
relates to the fact, in long- memory scenarios, x is the standard estimator of /i, and 
typically converges faster than other features (e.g. d) of the model. 

Second, as explained in the previous section, we replace the inverse of T(fg) by 
T(4ir 2 1 fg) which leads to the following approximation of the quadratic form: 



A 2 — 

(3.1) ^T(f g )^x « xT(^)x = c^HU) 

where the coefficients Ck(x) may be pre-computed, once and for all, from the data 
(denoting $4 — Xi — x the components of x): 

£?=i*? if J=0 

2 Yh=i if 3 = l,...,n- 1 



Cj (x) 



and the 'jh(j)' are the coefficients of the Toeplitz matrix T(4ir 2 / f g ), that is, the 
Fourier integrals corresponding to function h = 1/ (47r 2 /). Using the method de- 
scribed in Section 2.3 one obtains a O(nlogn) overall cost for evaluating the quad- 
ratic form. 



COMPUTATIONAL ASPECTS OF BAYESIAN SPECTRAL DENSITY ESTIMATION 9 



Third, we approximate the determinant using a 0(1) asymptotic approximation, 
as explained in the next section. 

3.3. Determinant approximation. Approximations of determinants of the form 
|T(/e)| typically rely on asymptotic expansions. For instance, Whittle's approx- 
imation consists in replacing log |T(/<j)| jn by its limit, 

-log|T(/ e )| f log{2vr/ e (A)} dX. 



Chen et al. (20061 use more refined asymptotic results on Toeplitz matrices (e.g. 
Bottcher and Silbermann 1999 p. 177) to obtain a more accurate approximation. 



Using their approach, one obtains in the FEXP case 

fc fc 

A 



log |T(/)| « D n (fe) = d 2 logn + j£ j£f + d£ ifc + log g ( | _ f 

3=1 3=1 [ ' 



where G is Barnes' function (Adamchik 2001). This approach is easily adapted to 



other time series models, such as ARFIMA; we refer to Chen et al. (2006) for more 
details and possible extensions to other classes of models. 

3.4. Further approximation. The cost of the approximation described in the 
previous section is that of a FFT operation, that is O(nlogn). This cost may be 
further reduced by remarking that the approximation of the quadratic form may 



be rewritten as (see e.g. Palma 2007 Chap. 4) 



^Tif^x a x'T( 



x = 



J(A) 
2^ J-* f e (X) 



dX, /(A) = 



? e ijX 



that is, I(X) is the periodogram of the (centred) dataset x. 

It is relatively easier to evaluate I(X) at the Fourier frequencies Xj = 2wj jn. 
which suggests a further approximation, where this integral is replaced by a Riemann 
sum computed over the Xj : 



/(A) 



2tt J_„ f e (X) 



dX 



Computing simultaneously the I(Xj)'s requires performing a FFT transform. The 
cost is 0(n log n), but this needs to be done only once, for a given dataset. Then the 
approximate likelihood may be evaluated for many different values of 9, at a O(n) 
cost. Since typically about 10 4 — 10 6 such evaluations are performed when Monte 
Carlo sampling from the approximate posterior, one may for practical purposes 
ignore the pre-computation time, and consider this further approximation as a 
0(n) operation. 

To conclude, our final approximation takes the following form: 



(3.2) 



P(x\0) oc DJ0) 



-1/2 



b+ - 



n 



-a—n/2 



This is very close in spirit to Whittle's approximation of the likelihood, which is 
based on the idea that the /(Aj)'s are nearly independent, with variance fg(Xj); see 
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again e.g. Palma (2007 Chap. 4). Rigorously speaking, we have not been able to 
establish that this further approximation is valid in the sense defined in Section [XT] 
that is, that the variance of an importance sampling step from the approximated to 
the true likelihood converges to zero. We merely observe empirically that this ap- 
proximate likelihood is nearly indistinguishable numerically to the more principled 
approximation developed in the previous Sections, that is: 

-o+n/2 

_ -)s ! 

2 y ^ 2 fe J 

On the other hand, we also observe that the speed improvement brought by this fur- 
ther approximation is rather modest , which is in line with the respective theoretical 
costs of 0{n) and 0(n log n). From now on, we do not distinguish between these 



p(x\0) cx D n {6)- 1 ' 2 { b+ -x'T(-^=r-)x 



two approximations. We only note that an additional advantage of (3.2 1 is that 
its gradient is easy to compute, which would make it possible to use Langevin-type 
MCMC moves. 

4. Monte Carlo sampling 

4.1. Background. This section discusses Monte Carlo sampling from the approx- 
imate posterior, that is, 

n n (9) <xp(0)p{x\O) 

where p(6) is some prior density defined with respect to parameter 0, and p(x\0) 
is the approximate likelihood defined in Section [3j Although this discussion is, as 
before, not specific to the FEXP model, we shall assume for notational simplicity 
that the considered model is parametrised as follows: 

= (fc,0 fc )eiWi}xM i+1 . 

For instance, in the FEXP case, we have seen that 6k = (logit(2<f), £i, . . . , 
We shall also assume a nested structure for the 0' k s, that is, 0$ = (6o), 0^+1 = 
(01, 9 k +i) T - Again, in the FEXP case, 9 = logit(2d), 9 3 = ^ for j > 1. Finally, we 
denote Pk(@k) = p(9k\k), the prior of Ok, conditional of k, pk+i(9k+i\0k), the prior 
of component 0f.+i, conditional on the first k components of Ok+i being equal to 
those of vector Ok, and p(x\k, Ok) the approximate likelihood p(x\0) for 6 = (k, Ok). 

A common approach to this problem would be to alternate the steps described 
below as Algorithms [TJ and [2] that is a random walk Metropolis step with respect 
to Ok, conditional on k, and a birth-and-death step, that is, a particular instance 



of Green (1995)'s reversible jump sampler, which attempts at either incrementing 
(birth) or decrementing (death) k. In case a birth is proposed, the current vector 
Ok is completed with a draw from the conditional prior pk+i(0k+i\9k)- In case a 
death step is proposed, component 9k is deleted from the current vector Ok- 

Of course, many variants of this basic MCMC strategy could be considered. 
However, such variants are unlikely to directly address the following two limitations 
of standard MCMC: (a) calibration; and (b) local behaviour. Calibration refers 
to the difficulty of choosing certain tuning parameters, such as that the scales 
Sfe in the random walk step. It is well known that the choice of such tuning 
parameters may have a critical impact on the performance of the algorithm. We 
shall see in our numerical study, Section |5]), that tuning manually the S^'s is rather 
tedious. A more principled approach would be to use some form of adaptive MCMC 



sampling, see Andrieu and Thorns ( 2008[ ) for a review, where one uses past samples 
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Algorithm 1 Gaussian random walk Metropolis step (conditional on k) 
Input: 6 = (k, e k ) 
Output: 8' = (k,6' k ) 

1. Sample 9* k ~ N{O k ,Y, k ). 

2. With probability 1 A r, take 9 k = 9 k . otherwise 6 k = 9 k , with 

r = Pk (9* k )P(x\k,e* k ) ^ 
Pk{dk)P(x\k,e k )' 



Algorithm 2 Birth and death step 
Input: 9 = (k,6 k ) 
Output: 0' = (k',d' k ,) 

Constants: pu^-k+i = p k ^ k -i = 1/2 for k > 1, p ^i = 1- Po^-i = 0. 

1. Let 6* = (k*, 9 k *), where, with probability p k ^ k+ i, k* = k + 1, 

e * k * = (0fcA+i), and 6 k+1 ~ p fe+1 (£ fe+1 |0 fe ) (birth step); otherwise, with 
probability p k ^ k -i = 1 - p k ->k+i, k* = k - 1, 6* k < = (6 , 9 k -i) (death step). 

2. Set 8' — 6* with probability 1 A r, otherwise 0' = 6, with 

p k ^ kP {k*)p(x\e*) 

V = . 

Pk^k*p{k)p{x\0) 



to iteratively adapt the tuning parameters to the target distribution. However, 
adaptive MCMC is less straightforward to use in trans-dimensional settings (as an 
infinite number of tuning parameters Sfe must be learnt), and also does not address 
the second difficulty, which we now comment upon. 

The phrase "local behaviour" refers to the fact that MCMC samplers have diffi- 
culties exploring multi-modal posteriors, as they tend to get trapped in the attrac- 
tion of meaningless modes. One may wonder if multi-modality is an issue for the 
classes of models considered in this paper. 

In our simulations, we observed that the FEXP model may indeed generate 
multimodal posteriors, and we propose the following explanation. Figure |4~l] plots 
a spectral density of FEXP type, see ( |2.1[ ), with d = 0.4, k = 3, and (£i, £2, £3) = 
(1, —1, 1). Overlaid are two other FEXP spectral densities, but with fc = 10 in both 
cases, d = 0.2 (dotted line), d — (dashed line), and coefficients £j fit by least- 
squares on the Fourier frequencies 7rj/100, j = 1, . . . , 100. These three densities 
are hard to distinguish except maybe in the close vicinity of 0. This indicates that, 
unless the sample size is very large, the posterior distribution may be multi-modal, 
with modes that may correspond to values of d which are very far from the true 
value, while k is set to a larger value so as to introduce additional components to 
compensate the bias in d. 



We also refer the readers to Section 5.3 (and in particular Fig. 5.3 1 for numer- 



ical evidence of this type of multimodality in one example, and its impact on the 
performance of MCMC. 

4.2. Sequential Monte Carlo. Algorithm [3] describes a generic SMC sampler 



Compared to the more general framework of Del Moral et al.| ( 2006 ) , this algorithm 
is specialised to the case where the sequence of distributions (r/ t ) is defined on a 
common sampling space, 0, and also to a specific choice of the backward kernel, see 
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FIGURE 4.1. Graphical illustration of a source of multimodality 
for the FEXP model: FEXP log spectral densities, with k = 3, 
d = 0.4, (£i, £21 £3) = (1, — 1, 1) (solid line), compared to k = 10, 
d = (dashed line), and d = 0.2 (dotted line), and coefficients £j 
adjusted by least-squares estimation. 



Algorithm 3 Generic SMC sampler 
All operations are for j = 1, . . . , N. 

(1) Sample 6M - r] (6). 

(2) For t = l,...,T, do: 

(a) Reweighing step: Compute and normalise weights 



(b) Resampling step: sample J from the multinomial distribution that 
assigns probability W\ to value L t , I = 1, . . . ,N. 

(c) Move step: regenerate J as 

e 3 ~ K t {d j ,do) 

where K t is a Markovian kernel that leaves rj t invariant. 



Del Moral et al. ( 2006 1 for more details. The algorithm depends on the specification 
of a sequence of distributions (rj t ) and a sequence of Markovian kernels (K t ). The 
former must be such that it is easy to sample from 770, and that t]t — TT n . the target 
distribution (the approximate posterior, as defined in the previous section). The 
latter must be such that K t leaves invariant rj t , and is typically a MCMC kernel. 
For (rj t ), we take a particular annealing sequence (Neal. 20011, that is, a geo- 



metric bridge between the prior and the approximate posterior 

(4.1) m(o)K P {0){p{x\e)}"' t 

with 70 = < . . . < 7t = 1. The weight function is then: 

wtifi) =p{x\0) a \ a t = ( 7t - 7 t _ x ). 
We shall briefly discuss in the conclusion an alternative choice for (%), based on 



the IBIS strategy of Chopin (20021, which may be useful in sequential estimation 
scenarios. 
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We follow Jasra et al. (2011 ), Schafer and Chopin (2011 1, and adjust dynamically 



the annealing coefficients j t by solving at iteration t, with respect to variable a t , 
the equation 



£jlx«*(*: 



2 

- =cN 



for some fixed c G (0, 1); in our simulations, we took the default value c = 1/2, 



and we used Brent's root-find algorithm (Press et al. 2007 Section 9.3) to solve 



numerically this equation. The left hand side is the effective sample size of |Kong| 
et al. ( 1994| ; it takes values in [l,iV], and is a convenient measure of the weight 



degeneracy. 

For (Kt), we use M steps of the MCMC sampler described in the previous 
section, that is, we repeat M times the following sequence: Algorithm [l] (random 
walk Metropolis), then Algorithm^ (birth-and-death). A big advantage of the SMC 
framework is that we can use the current particle system to calibrate these MCMC 
steps. Specifically, before each move step we set E; = tiSi, tj = 2.38 2 / (Z + 1) , where 

Si is the covariance matrix of those resampled particles 9 W = (k ( J\9 { ^ such 

that fctf) = I; in case this set is empty, we take instead Si = J{. This particular 
choice is motivated by the theory on optimal scaling of Hastings-Metropolis kernel, 
as developed in |Roberts and Rosenthal (20011. 



The only tuning parameters of this algorithm are the number of particles N, and 
the number of MCMC steps performed at each move step, M. Regarding the choice 
of M, we observed the following interesting phenomenon. Increasing M from 1 to 
some small integer (say in the range 5 — 20) often led to a dramatic improvement 
(as compared to increasing N, relative to the same CPU cost). In particular, one 
observes cases where particles are close to IID (in the sense that the empirical 
variance of certain particle estimates over repeated runs is close to the variance of 
the corresponding estimate for IID particles). At this point, increasing further M 
seems to bring no improvement. This point is illustrated in our numerical study, 
see Section [5] This phenomenon seems related to a recent result (obtained in a 
slightly different context) by Dubarry and Douc ( |2011 1, who establish a central 



limit theorem, with asymptotic variance equal to the variance under the target 
distribution (hence the same as for IID particles), by taking M = \og(N). This 
formal result seems to suggest that one does not need to take M very large to obtain 
nearly IID particles. At any rate, this point surely deserves more investigation in 
other contexts. 

Again, one may propose many variants of this SMC sampler. For instance, 
one could also use adaptive reversible jump steps, where the jump probabilities 
Pk-tk+ii Pk-tk—i (currently set to 1/2) could be optimised using the particle sample, 
or even design more elaborate proposals based on a family of linear transforms, 
as in Green (2003), where the corresponding matrices may also be learnt from 
the particle sample. (We did some experiments with this strategy, but did not 
obtain significantly better results in the examples we looked at.) However, our 
main focus here is to develop a black box algorithm, which requires as little input 
from the user as possible, while giving reasonable performance, even in the presence 
of multimodality. Our numerical experiments seem to indicate this is the case, see 
Section [5] 
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5. Numerical experiments 



5.1. Settings. The first part of this numerical study focus on the following sim- 
ulated example. We sampled a long time series from an ARFIMA(l,d, 1) model, 
with d = 0.45, and AR (resp. MA) coefficient —0.9 (resp. —0.2), and applied the 
FEXP model described in the Introduction. We find this example to be challen- 
ging, because the spectral density of the simulated process has a particular shape 
which is not easily approximated by a FEXP spectral density, at least for small 
values of fc; see e.g. Fig. 5.4 and additional comments around this Figure. (From 



a modelling perspective, the presence of an autoregressive root close to one makes 
it difficult to determine whether the persistence in the data is really character- 



istic of long range behaviour.) Thus, and as described more properly in Rousseau 



et al. (20121, the FEXP model described in the introduction must be understood 



as a semi-parametric procedure, which makes it possible to consistently estimate 
the true spectral density (and related quantities, e.g. the long-range coefficient d), 
provided this spectral density belongs to some infinite-dimensional class of func- 
tions (of a certain regularity). In particular, one expects the posterior of k to shift 
towards infinity as n goes to infinity, which adds to the computational difficulty. 
The second part of the study considers a real dataset, see Section [BTBj 

Except in the consistency study (Section 5.4 1, the dataset consists of the n = 10 4 
first values of the simulated time-series. In the consistency study, different sample 
sizes are compared, by again taking the n first points for different values of n. 
Except in the prior sensitivity section (Section 5.5 1, we always take the follow- 
ing prior: independently, d ~ Uniform [0, 1/2], k ~ Geometric(l/5) (with sup- 
port {0,1,2,...}), £j ~ AT(0, 100j- 2/3 ), n\a ~ AT(0, (T 2 /^), 1/cr 2 ~ Gamma(a,6), 
with hyper-parameters j3 = 1, a = b = 0.5, = 0.1. Note that this prior falls 
slightly outside the class of prior distributions that ensures consistency, accord- 



ing to Rousseau et al. (20121. In fact, we found it interesting to see whether the 



conditions to ensure consistency of Rousseau et al. (20121 may be too strong; our 



numerical study seems to indicate it may indeed be the case. 

All the simulations presented in this Section were performed on a standard 3GHz 
desktop computer, without resort to any form of parallelisation, and implemented 
in the Python language (using the NumPy and SciPy libraries); the programs are 
available on the first author's web-page. 



5.2. Performance of SMC. This section is concerned with the performance of 
the SMC sampler, regarding sampling the approximate posterior. Note that the 
final correction step (i.e. the importance sampling step from the approximate to 
the true posterior) is therefore not performed; see Section 5.4 for an evaluation of 
the performance of the correction step. 

We run the SMC sampler 10 times, for N = 1000 particles, and M = 5 MCMC 
steps; CPU time is 20 minutes per run. Variability over these ten runs is small 
for posterior expectations of the £j's (not shown) and d (see top left plot of Figure 
5.1 1, but is a bit larger for the posterior distribution of k (see same plot). The left 
plot of Figure [572] reveals that the acceptance rate of the birth and death step is is 
often below 10%, which suggests M = 5 steps may not be enough to successfully 
move the particles with respect to component k. In that plot, the x— axis gives the 
annealing coefficient; that is j t at iteration t, which goes from 70 = to 7^ = 1, 
see (4.1 1. Note that this axis is a non- linear function of elapsed CPU time, as the 
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FIGURE 5.1. Top row: box-plots, over 10 repeated runs, of the 
SMC estimates of the posterior expectation of d and the posterior 
probabilities of k = 9, . . . , 14 (Left: (N,M) = (10 3 ,5); Right: 
(N,M) — (10 3 ,20)); Bottom Left: marginal posterior of d, as 
estimated by superimposed (with transparency effects) weighted 
histograms obtained by 3 first SMC runs; Bottom Right: hexagonal 
binning plot of k versus d, from the first SMC run; a N(0,0.1 2 ) 
jitter is added to k, and colour is proportional to the sum of the 
weights of the particles falling in each hexagon. 



sequence 74 grows slowly at the beginning, and progressively accelerates, whereas 
the cost per iteration t should be roughly constant. 

We run again the SMC sampler with N — 1000 particles, but this time with M = 
20 steps; CPU time is then 1 h our 20 minutes. We obtain very satisfactory results; 



see the box-plots in Figure 5.1 In particular, regarding the posterior expectation d 
(with respect to the approximated posterior) of component d, we observe that the 
empirical variance of the estimates of d obtained from the ten runs is very close 
to Var ir "(d), that is, the variance would be obtained from N = 1000 independent 
and identically distributed simulations from the posterior. Figure |5.1| also reports 
superimposed weighted histograms obtained from the second set of runs, and a 
particle approximation of the bivariate posterior distribution of (k, d). (Some noise 
was added to k for the sake of visualisation.) 

5.3. Performance of MCMC. This section compares the performance of our 
SMC sampler with standard MCMC. As the previous section, this comparison is 
in terms of sampling from the approximate posterior, and no correction step is 
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-. (tempering coeff.) 




FIGURE 5.2. Left: acceptance rates of the random walk step 
(circles) and the birth and death step (squares) versus the tem- 
pering coefficient j t , which progresses from to 1 during one run 
of the algorithm (N — 1000, M — 5). Right: CPU times in seconds 
for the SMC sampler targeting the approximate posterior (squares) 
and the correction step (stars) versus the sample size of the data. 



performed at the end of the algorithm. The MCMC sampler we use is the one 
described in Section |4~T1 

We simulate several MCMC chains with different tuning parameters, but always 
of size N = 1.6 X 10 5 , which gives a running time per chain of about 50 minutes 
(as compared to 20 minutes for the first sequence of SMC runs) . 

We start by 10 runs, with the scale of the random walk in the <?fc-space, set 
to rlk+i (t times the identity matrix of rank k + 1); after some pilot runs, we take 
r = 0.015 so as to obtain an acceptance rate for the random walk step close to 25%. 

One chain seems to alternate between two local modes, one around (k, d) = 
(20, 10~ 3 ), the other around (k,d) = (30, 10~ 3 ); see first row of Figure [5^1 The 
nine other chains do not get trapped in these local modes; however their mixing 
properties seem quite poor; see the MCMC traces for d the middle row plot. Note 
how the darkest trace never seems to visit the region d > 0.45, which accounts for 
20% of the posterior mass, according to our SMC runs. 

These results are hardly satisfactory. We also experimented with set to r 
times the prior covariance matrix of Ok, but this did not give better results (not 
shown) . Finally, we implemented a (crudely) adaptive version of our random walk 
Metropolis sampler, for the posterior conditional on k = 10 (no reversible jump 
steps), where, during a burn-in period of 5 x 10 4 iterations, the covariance matrix of 
the random walk proposal is re-calibrated every 10 3 iterations, using past samples ; 
specifically Ej, is set to (2.38 2 /(fc + l))S t (see Roberts and Rosenthal 20011, 
where St is the empirical covariance matrix computed on the t first iterations. 
One sees that learning the correlation structure of the posterior (conditional on 
k) significantly improves the mixing of the chain, although the auto-correlation 
function (with respect to d, and computed post burn-in) decays slowly, see bottom 
row of |5.3| However, learning the posterior covariance matrix, conditional on fc, for 
each A; in a given range, would of course be wasteful. 

5.4. Consistency study. In this section, we show how both the approximate and 
the true posterior evolve as the sample size grows. The goal is to assess both the 
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FIGURE 5.3. Top row: one MCMC run alternating between two 
minor modes (Left: trace of k, Right: scatter-plot of (k, d), with 



a jitter term added to k as in Fig. 5.1|. Middle row: traces of 
d for five out of the nine remaining MCMC chains. Bottom row: 
adaptive MCMC; trace of d (left) and ACF plot (right), computed 
post burn-in. 



asymptotic properties of our FEXP semi-parametric procedure, and the effect of 
the correction step on inference. Results are obtained from a single SMC run. 

Figures |5.4| and |5.5| summarise posterior inference for the sample sizes, from 
top to bottom of Figure [B~4] n = 100, 300, 1000, 3000, and, from top to bottom 



of Figure 5.5 n — 10 4 and n = 10 5 . On the left (resp. right) side, the marginal 
posterior distributions of d (resp. the 80% confidence bands for the spectral density) 
are compared; light grey is the approximate posterior, and dark grey is the true 
posterior. (Transparency effects are used.) One sees that the effect of the correction 
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step becomes unnoticeable very quickly as far as the estimation of the spectral 
density (away from 0) is concerned. The effect on the marginal posterior of d takes 
a bit more time to vanish, but is already negligible for n around 3000; in fact, 
the effective sample size of the correction step is already above 900 (out of 1000 
particles) for n = 3000. 

The right panel of Figure |5.2| compares the CPU cost of the SMC sampler and 
the correction step. One sees that, unless n is of order of 10 4 or more, the overall 
cost of the complete procedure is dominated by that of the SMC sampler. For 
n 10 4 , the correction step becomes quickly too expensive, because of the 0(n 3 ) 
cost, and was not performed for n — 10 5 . (Which is why the dark grey plots are 



missing in the bottom row of Figure 5.5 ) Fortunately, one sees that for n ^ 10 



14 



the correction step seems to be superfluous. 

5.5. Prior sensitivity analysis. We have seen in Section [ST] that the prior for the 
FEXP coefficients £j was set to N(0, lOOj" 2 ' 9 ), with /3 — 1; this hyper-parameter is 
related to assumptions on the smoothness of the true spectral density (the larger /?, 



the smoother the true spectral density); see Rousseau et al. (20121. In this section, 
we consider briefly the effect of j3 on posterior inference. 

Figure [576] gives the same types of posterior plots as in the previous sections, i.e. 
80% confidence bands for the spectral density, and marginal posterior for the vector 
(k, d), for = (Left side) and j3 — 2 (right side). This must be compared with the 
top right plot of Figure [575] and the bottom right plot of |5.1| for which j3 = 1. One 
sees that the choice of f3 has a strong impact on the posterior marginal distribution 
of k, but a rather moderate impact on either d or the spectral density itself, except 
maybe on the right edge of the spectral density plots (for A close to tt). Since k is 
essentially a nuisance parameter, one sees that the choice of (3 does not seem to be 
too critical for inferential purposes. On the other hand, it is interesting to note the 
impact of j3 on the computational difficulty to explore the posterior. We observe 
that, for /? = 2, it is even more difficult for a MCMC sampler to escape local modes 



such as those shown in the top row of Figure 5.3 (corresponding MCMC traces not 
shown) . 

5.6. Real data study: Ethernet Traffic. In a preliminary study, we applied 
our methodology to the popular Nile data set, but found the example not to be 
very challenging, either from a computational point of view (n = 663), or from an 
inferential point of view (an ARFIMA(0, d, 0) model, or in other words, a fractional 
Gaussian noise model, seems to fit the data well). More details may be obtained 
from the first author. It is perhaps interesting to note that real datasets considered 



Pai and Ravishanker ( 1998 ) 
= 318, the sunspot dataset of Choudhuri et al. ( 2004[ ), n = 288, and so on. 



in the literature are rarely larger; see e.g. the US GNP data of Koop et al. (19971, 
n = 172, the central England temperature data of 
n 



This remark seems to give additional support to two aspects of our work: (a) to 
perform exact Bayesian (rather than asymptotics-based) inference; and (b) that 
the correction step, although 0(n 3 ), is typically negligible in practical examples, as 
shown in Fig. |5.2| 

In this paper, we consider instead the Ethernet Traffic dataset of Lel and et al. 



(1994), which can be found in the longmemo R package. This is a time-series of 
length n = 4000, which records the number of packets passing through a particular 
network per time unit. (For convenience, we divided the data values by 1000, in 
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Figure 5.4. Consistency study for the ARFIMA data, Left: mar- 
ginal posterior of d, approximate (light grey histogram), or exact 
(dark grey histogram); Right: 80% confidence bands for the log- 
spectral density (same color code), true spectral density (dashed 
line). From top to bottom, sample size is 100, 300, 1000, 3000. 
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FIGURE 5.5 . Co nsistency study for the ARFIMA data, same cap- 
tion as Fig. 5.4 sample size is 10 4 (top) and 10 5 (bottom). 



order to use the same prior as for the simulated dataset.) The right side of Fig. 5.7 
plots the empirical spectrum of this time series. Although the empirical spectrum 
is an asymptotically biased estimator of the true spectral density under long-range 
dependence, the bias is typically small for frequencies sufficiently far away from 



(e.g. Moulines and Soulier 2003). Thus comparing the empirical spectrum with 



a given estimator of the spectral density provides at least some guidance on the 
performance of the said estimator. 

Interestingly, we find these data to be quite challenging for frequentist parametric 
procedures. The FEXPest command of the longmemo R package, which computes 
Beran ( 1993 1's estimator for a parametric FEXP model, either returns the estimate 
d = 0.52 - although d is supposed to be in (0, 1/2) - or d = 0.222 or d = 0.316, 
according to a tuning parameter which regulates the polynomial order selection. 
The fracdiff command from the fracdiff package, which does maximum like- 
lihood estimation for ARFIMA(p, q) models, returns d — 0.3 for p = q = 3 but 
the estimated spectral density, plotted as a dashed line in the right panel of Figure 
5.7 does not seem to fit the spectrum of the data even for higher frequencies. (Ac- 
cording to the same procedure, the orders p — q = 3 give the smallest AIC among 
ARFIMA(p,p) models, with p < 10). 

In contrast, our Bayesian semi-parametric procedure seems to fit the data rather 
well; see again the right side of Figure |5.7| We find strong evidence in favour of 
long-range dependence, as evidenced by the marginal posterior distribution of d in 
the left panel of Figure |5.7| 
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FIGURE 5.6. Prior sensitivity analysis: 80% posterior confidence 
bands for log-spectral density (top), hexagonal bin plot of (k + 
jitter, d), where jitter is N(0, 0.1 2 ) noise (bottom), for two different 
priors: p = 0, i.e. £j - N(Q, 100) (left); and = 2, i.e. ~ 
7V(0,100j^ 4 ) (right). 



From a computational point of view, we mention briefly that we observe the same 
things as in the previous section; that is, that the correction step has a negligible 
impact on the results, and that the output SMC sampler shows little variability 
when run several times. 



6. Conclusion 

Several straightforward extensions of this work could be considered. First, as said 
in the introduction, the approach we proposed could be adapted with little effort to 
other, parametric or semi-parametric, prior distributions for the spectral density /. 
Second, as pointed out by both referees, one may consider the problem of sequential 
estimation, that is, to compute sequentially the sequence of posterior distributions 
of the t first data-points, where t is incremented by one at each iteration. Of course, 
if a long-range dependent model is considered, then by construction the posterior 
distribution may not provide relevant information on / if t is too small, so perhaps 
sequential estimation should be started only after some initial sample of size no 
has been processed. In that case, one could use a variant of the IBIS algorithm of 



(Chopin 2002 ), that is, a SMC sampler based on the sequence of posterior distribu- 
tions based on t data-points, t = n , . . . , n, but with the likelihood replaced by the 
approximation presented in Section[3] This would lead to a 0(n 2 log(rc)) algorithm, 
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FIGURE 5.7. Ethernet data, Left: marginal posterior distribution 
of d from the FEXP semi-parametric model (as represented by 
a weighted histogram obtained from the SMC sampler); Right: 
Bayesian 80% confidence band for the spectral density (dark grey) , 
spectral density corresponding to ML estimation of an ARFIMA 
model, with orders p = q = 3 (dashed line). 



where n is the total number of observations, because computing the approximate 
likelihood at iteration t would cost 0(t\og(t)). (For the same reasons, using exact 
likelihoods would give a 0(n 4 ) algorithm, which would be far too expensive.) Thus, 
this approach would be useful only for sequential estimation, and not when one is 
interested only in the posterior over the complete data-set, as one may use instead 
in that case the SMC sampler described in this paper, which is 0(n\og(n)). 
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Appendix: Proof of Theorem 1 



The proof borrows several intermediate results from Rousseau et al. (2012 1. Re- 
garding the prior distribution, we make the following assumptions: 

• the prior p{6) decomposes as p{d)p(k)pk(£, k ) , where the support of p{d) is 
[0,1/2-t] for some t > 0; 

• the prior for K 7 p(k), is a Geometric distribution; 



• Conditional on K = k, the coefficients £j's in (2.1 1 have a Gaussian distri- 
bution 

o(j + ir +1/2 ~AA(o,i) 

and they are independent except for the fact that the vector ^ must belong 
to the set 

k 

E k (a, L) = e R** 1 ; Ytf + lf a+i e 3 < L} 
where L is a large positive constant. 
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Regarding the true distribution of x, which is denoted by P™ from now on, we 
assume that x ~ N(0 x l,T(/ )), where the true spectral density f a admits an 
infinite FEXP representation, that is, a FEXP representation as defined by (1.2 1, 
but with parameters d Q > 0, k Q = +00, and £0,0, £1.0, ■ • ■ belonging to the Sobolev 
ball of radius L, E(/3, L) = {(&, fi, • ■ ■); Y^LqU + l ) 2P+1 Q < L }- Throughout the 
proof we denote by |A|p> = frfA^-A] 1 / 2 the Froebinius norm of a matrix A and by 
||j4|| 2 = svp{x T A T Ax; \\x\\ = 1} its euclidian norm. 

We now prove that, under the above conditions (on the prior distribution p(9). 
and on the true distribution P™ of x) , and provided < d a < 1/2 — t, f3 > a > 3/2, 
and for L Q small enough (compared to L), one has: 

E*" [w 2 (0)] =wl{l + o p (l)}, w 2 o = 0p (e^ l °^ 3/2 ) 

for any sequence (u n ) such that u n — > +00, and where w (resp. w ) is used as a 
short-hand for Wcoti (resp. WQ orl ) in the rest of the proof. One may obtain the 
same type of results for E 71 ™ using the same calculations. 

One has: 

fw 2 (6) ex V {-lx T T(l/4n 2 f e )x - Z„, } |T(/ e )p 1/2 p(8) dO 



E ff " [w 2 (0)] 



/exp {-\x T T(l/4x*f )x - l n ,o] \T(f e )\- 1/2 p(0) dO 



Nn 

D n 



where Z„, = - {x T T{f )- 1 x + logdetT(/ )} /2. 

Let e 2 = (n/\ogn)~ l3 /( 2 P +1 * > . The idea of the proof is to first show that 



(6.1) 



E^" 


'w 2 (ey 











w 2 {6) 



n Snl (6)n k < knl 



o P (l) 



where S n ,\ = {0\l{f o ,fe) < k n ,i = k Q en 1/a for some k a > 0, and l(f,f) is 
the L 2 distance between spectral log-densities, l(f,f) = {log (///')} • Let 
S n ,i,k — {9k'i 9 — {k, 6k) € S n ,i}- In a second step, we show that 

(6.2) sup I log w{8) - log w \ = o p (l) 

uniformly on fc < fc„ 1, which would then conclude the proof of Theorem 1. 



( 


2012 


1 and 


Kruijer and 



terior 7r„ follows the same lines), there exists c > such that 



P" 



D n > e 



= 0(1). 



(6.3) 

Indeed, let k n = [A: (n/logn) 1/(2 ' 9+1) J with k > and 

B n = {0; k = fc„,d < d < rf Q + n~ a , |0 - Sj, \ < n- a ,j = 1 
then it is easy to see that, for some c\ > 0, 

P P(e) [Bn] > e - Clfc " lo S™, 

where P^ 61 ) denotes the prior probability. As in 
(poTl), let n n {9) = {x;l n {9) - I, 

l n (9) = - l -x T T{\/^ 2 f e )x - i logdet [T(/ e )] 



k n } j 



and Rousseau 



Rousseau et al. 



(20121 and Kruijer 



-ne 2 }, with 
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then we have that for all 6 <E B n , 

(6.4) p: {n n (ey} = o(i) 

if a is chosen large enough. Indeed let A{6) = T 1 / 2 (/ )T(l/4vr 2 /e)T 1 / 2 (/ ) - I n 
and B{6) = T 1 / 2 (/ )T(,f e )- 1 T 1 / 2 (/ ) - I n . Using Lemma 2.4 in the supplement of 
|Kruijer and Rousseau| ( [201l] , for all € B n , we have that tr [T~ 1 l 2 {f )T{f g )T- 1 / 2 {f )\ 
2n for n large enough so that all eigenvalues of T 1 / 2 (/ )T(/e) _1 T 1 / 2 (/ ) are 
bounded from below by 1/2 and 

log \I n +B(8)\ = tr [B(6)]~ix [((/„ + rB^))- 1 B(8)) 2 ] > tr [fl(0)]-2tr [B(9) 2 ] . 



Moreover using Lemma 2.4 in the supplement of |Kruijer and Rousseau| ( |2011[ ), we 
have that tr [-£?(<?)] = tr +0(n e ) for all e > so that, since for n large enough 

and 6 G B n , 2tr [B(0) 2 ] + n e < ne 2 n , 



P o n [O«(0) c ] < 



One also has: 



n 

2^ 



[z T A{e)z - trL4(0)] > ne 2 n ] 



1 )d\ + Error, 



where the Error is controlled by Lemma 2.6 in the supplement of Kruij er" and| 



Rousseau (2011), since we have 



\A(6)\ 



Fr 



= tr 



{T(f^)T(f -f e )Y 



and f — fob where (taking ^ = for j > k) 

{oo 
J2(€o,j -&)cosjA 
3=0 

and we note that, for 8 G the function exp |X)j=o £7 cos j'a| is Lipschitz with 

constant 0(kn^ 2 ~^ + ) = O(l), for a is large enough (see Lemma 3.1 in the supple- 
ment of Kruijer and Rousseau| ( 2011[ )). Therefore we obtain 

Error = 0(\ogn\\b - 1|U) = O(logrc), 

and 1^4(0)1^,. = 0(n 1 ~ a + logro) = o(n T ne 2 ), for some r > 0, which combined with 
Lemma 1.3 in the supplement of |Kruijer and Rousseau (2011), leads to, for some 
c> 0, 

C R,(0) c ] <e~ cn \ 

We now turn to the second part of the proof. Let S n .j,k — {8k',l(fo, f(k,e k )) G 
((j - l)en, je«)}, for j = !,-■■ , J n where J„ = 0{e~ 1 ), and let 



AT, 



n,j,k 



W / 



(9) 2 exp{l n (8)-l o , n }p(0)d0 
exr> U„(9) — L. „,\ . _. 
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We prove that uniformly in Jq < j < J n and k 6 N, iVnj'.fc — °p( e_cnc ")i f° r some 
Jo > 0. To do so we bound 



sup \\og(w /w{8))\ > V n ,j,k 
ees„, 3 , fc 



for j > 2 and for a properly chosen v 



n,j,k- 



Let k n 



Jin Qe (log n) a e n 1 , with Ji > 0. Define for C > (large enough) and r > 
(small enough) 

(6.5) 

v n ,j,k = k 3/2 n e je n , if k<k Ut j, j>l 

V n:jik = (C\0gn(k A „(l+ £ )/(a-l/2)fc-l/(a-l/2)) + k^k'^ 2 , if k > k nJ . 

Note that if e is small. Cfclogn = o(n e ) for some k > k n j only if (jen) _1 ^ Q ;$ 



n £ (logn) i.e. if j > n Qe (logn) 



a c -\ .. 



We write — \ogw{&) + log(£ ) 



z'A{8)z/2 with z = T- 1 ' 2 {f )x ~ 7V(0, I n ), and, under P , and we bound success- 
ively 

,Tr mo\ A/a M_T 



and 



sup (* T L4(0) - A(0 d )];z T + tr[A(0) - A{0 d )]) 



sup z T L4 D - A(6 d )]z + ti[A - A(0 d )}\ 

d 



with d = (d, k, £d,k) and = argmin£ eRfc+: J(/o, /d.fe.e), see Kruijer and Rousseau 
fl2011| ). We now study 

sup (z'[A(8) - A{8 d )]z + ti[A(0) - A(0 d )}) . 

ees„, 3 , fc 

Note that n^ 1 \A(6) - A{0 d )\ 2 Fr = n~Hx [(A(0) - A{0 d )) 2 ] converges towards 0, so 
we only need control the approximation error, which we split into the approximation 
error of tr [(T(/ )T(/ e 1 - and of tr [(T(f a )(T(fe)~ 1 - T(^)- 1 ) 2 )] . We 

now consider the first term. Note that 

k 

fe 1 - f§! = /^(expEfe " (Uk)j)cos(jX)} - 1) := /=">(A) 



where 



sup \b e (\)\ < 10 - {Cd,k)j\ <VkU- | d ,fc|| < Vkje n . 

Ae[-7r,7rl 



3=0 



Note that if k > k n j then k 1 / 2 a < Vkje n and we bound instead 



sup ^(a;)] < J2 10 - (&,k)il ^ V^jU ~ id,k\\ + k 



-a+1/2 
3 



3=0 



Lemma 2.1 of the supplement of Kruijer and Rousseau| (2011) implies that the 
approximation error of the first term is bounded by 0((Y2j=i 10 ~ i^d.k) j\n e ) 2 ) — 
0{{n £ 'Vkje n ) 2 ) for all e' > and all k < k nj and is bounded by <3((£;~" +1/ V') 2 ) 



for all k > k n j and all e' > 0. From Lemma 2.4 of the supplement of Kruijer and 
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Rousseau (20111, the same bounds apply to the second term. Finally we obtain 



that 

\A{6) - A{9 d )\ Fr = 0(n e Vkje n ) if k < k nd 
\A(6) - A(0 d )\ Fr = 0(rfk- a + 1/2 ) if k > k n<j . 

This implies that for all £ £ S n j.k ■ 

• If k < k Hl j, since Vn^.kiVkjtn)' 1 = kn e , 

P(z T [A(e)-A(9 d )]z+tr[A(9)-A(e d )} > v nd , k ) < e~ ckn \ 

• If k > k n ,j, since v njik k"~ 1/2 > n e , 

Pr (z T [A(p)-A(O d )]z+ix[A(0)-A(O d )]>v nd , k ) < e"™— kk ^'\ 
Moreover a Taylor expansion of around A(0) implies that V<5 > and for all 

(6.6) 

k 

\z T [A(e)-A(O')}z+tr[A(0)-A(0')}\ < (z T z+n)n^-^+ +5 [Y^ \^-^\ + \d-^\]. 

3=0 

Without loss of generality we can restrict ourselves on fi ra = {z T z < 2n}, since 
P[£lfJ = o(e~ cn ) for some positive c. 

• If k < k n ,j and j < J n<1 , if ||£ - < n- 1 -^- 1 ' 2 and \d - d'\ < n" 1 " 6 , 

\z'[A(0) - A{0')]z + tr[A(6) - A{6')]\ = o(l), uniformly. 

If k < k n ,j and j > J n ,i, then 

\z'[A{e) - A{0')]z + tr[A(9) - A{9')]\ = o{ne 2 J 2 ), uniformly. 

Let E n ,j,k be the covering number of S n j tk by balls satisfying the above constraint 
then 

P . , < pC'fclogn 

• If k > k n .j. First if k > k2'ne 2 l , for some &2 > possibly large, then uniformly 
over £*=i |^ - ^| < rT l - e k and \d-d'\< n" 1 " 6 , 

\z T [A{0) - A(0')]z + ix[A(0) - A{8')]\ = o(k) 

The covering number of S n j^k by the above constraints depends on k. Define K n (k) 
such that K n (k)-( a -W = n" 1 "^, i.e. K n (k) = n( 1 + £ )/( a - 1 /2)jfe-i/(a-i/2) < xhcn 

E n ,j,k < exp(Clogn(fc A K n {k))) 

Now if k < k2ne 2 l , then uniformly over Ylj=i — £j\ — i _1_£ (nj 2 4) an d \d— d'\ < 

\z T [A(8) - A(8')]z + tx[A(9) - A(6')]\ = o(nj 2 e 2 n ) 

and 

E n ,j,k < exp(Cfclogn). 
A simple chaining argument in <SVij,fc implies that for all j < J n ^\ and all k < k n j, 
there exists M > such that 
(6.7) 



0„ n { sup (z T [A{6) - A(0 d )]z + tr[A(0) - A(9 d )}) > v ndtk + M} 



< e~ ckn ' 
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and if j > J n \, for any S > and n large enough 
(6.8) ' 

n n n { sup (z T L4(0) - A(O d )]z + ix[A(0) - A(0 d )]) > v ndlk + Snj 2 e 2 n } 

Note also that for all k < k n j, v n j. k — fc 3 / 2 n e je„ < 7i c (je n ) 1_3 /( 2a ' so that 
whenever a > 3/2 and j < J n ,i, Vn,j,k = o(l) and if j > J n ,\, v n j. k — o(nj 2 e 2 ). 
If k > k n j and k > A^ne 2 , for &U j, an d all 5 > 



— ckr 



(6.9) P 



sup (z T [A(0) - A(0 d )]z + tr[A(0) - A(0 d )]) > v„, j<k + Sk 



< e" 



If fc < fc^ne 2 , for all 6 > and n large enough. 
(6.10) 

fl n n { sup (z T [A{6) - A(9 d )]z + tr[A(0) - A{0 d )]) > v nd<k + Snj 2 e 2 n } 

< g-c(fc log n+n e ) 

and for all k > k n j, v n j.fc = o(k + nj 2 e 2 ). We now study 

sup \z T [A(6 ) - A(§ d )]z + tr[A(0 o ) - A{6 d )} I 

d 

We have 

|A(0 o )-A(0 d )|| r =O + Error, 

where Error is the approximation error of the trace by its limiting integral. We 

and of 



{T{f )T{f^ - f^)f 



bound separately the approximation error of tr 

tr [(T(/ )(T(/ )- 1 -T(/eJ- 1 ) 2 )]. We have, see |Kruijer and Rousseau| ( |201l| , 

fe d = f e( d - d °) H *- A «o,>°, H k (x) =J2^ cos (^)' ^ = 2 H Ad °* = Y, 6 °* cos ^ 

j>k j>k 

so that 

fe d - fo = fo((d-d )H k - A do , fe ) + 0(((d-d )H k - A^) 2 ^ 2 ^!) 
Therefore, from Lemma 2.6 in Kruijer and Rousseau (20111 in the supplement, 
(6.11) 

error (tr [{T{f Q )T{f^ - f^)) 2 ]) 

= 0((\d - d D \ + HA^fcHo,) 2 logn + (|d - 4| + ||A de)fc || 0O )l(/ O) fe d ) 1/2 (\ognf). 

Similarly using Lemma 2.4 in the supplement of Kruijer and Rousseau ( |2011 1 

error (tr [(T(/ )(T(/ ) _1 - T^J" 1 ) 2 )] ) = 0((\d - d \ + || A do , fc |U)V) 

for all e > 0. Let (fc, d) be such that 9 d £ S n j k , then 
(6.12) 

|A(&) - A{0 d )\ Fr = 0((\d - d„ | + HA^^IU)^) = 0(n^je n ) (2a - 1)/{2a) ), Ve > 
see Lemma 3.1 in Kruijer and Rousseau (2011 ). We also have that for all \d— d'\ < 



\z T [A(0' d ) - A(9 d )]z - tv[A(6' d ) - A(9 d )}\ < (z T z + njn" 1 = O p (l) 
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uniformly, so that a simple chaining argument combined with (6.121 and Lemma 



1.3 in the supplement of Kruijer and Rousseau (2011 1 implies that, for all e > 
(6.13) 

sup \z T [A(6 d ) - A(8 )]z - tx[A(e d ) - A(0 o )]\ > n^je^ 2 ^'^ < e' n "' 



Hence, combining (6.7), (6.9), (6.101 and (6.13) together with the fact that there 
exists J , Co > such that for all j > J 0] setting S n j — {8; (j — l)e„ < l(0 o , 8) < 



[ exp{l n {8)-l n (8 )}dn{8) < 

Js n , 3 



on a set having probability going to 1. Thus for all 6 > 0, there exists a set with 
P probability going to 1 such that 



j= J o fceN 

, w(8 ) w(8 d ) 

< } } sup — sup — TT^rPyk) , 

J~l t^i {d,i d , k )es nj , k w (°d) (d,e)es n , j>h w{6) J Sn 

J„ kn,l 
(ien) (2 <s - 1 )/( 2a ) 



j n (e)-i n (e ) 



d P {8) 



3 = Jo 
< e - 2cne " 



k=l 



>p(k) I 



in(e)-i n (e a ) 



dp{8) 



Finally note that if there exists £ = (d,k,0) € Uj < j S n j^ kl then 9 djk G ^j<j a S n j t k 
by definition of 9 dk . Replacing e„ by Joe n , with an abuse of notations, we write 
S n ,i,k '■= Uj<j a S n .j t k for all k and we split the set k into k < fc n ,i and k > fc n ,i- 



k=l 



k n ,l 

< w 2 ^2 _ sup 



^ (d,i d , k )es nA , k < {d ^s n . 1 , k w 2 {8 d ) 



and 



k n ,l 

v 2 (8) 

fc=i 



>^E 



inf 



mf „,- T p{k\X)1T n {bn,l,k\ k )i 



fr[(d^ d . k )es, l}1 , k w 2 (d,0)es n ,i, k w 2 (8 d ) 



we have using (6.13) associated with j = 1, with probability going to 1 uniformly 
in {d,$ d<k ) E L>k<k„ xS n ,i,k w 2 (8 d )/w 2 = 1 + o p (l), and using (lO) associated 



with j = 1, w 2 {8)/w 2 {8 d ) = 1 + o p (l) uniformly in (d, 6<) e U k<knl S nA . k . Let 
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k > k n ,i = k e n 1/a > (n/logn) >> ne 2 n , then 



E"" 



< 



w, 



l (0) E *s n , ltk 

fc=fc„ 1+1 



and on the set or x such that D n > e 



2 w E Vi, 



fc=fc n(1 +l 
oo 



(6.14) 



< E p(k) sup 



w (ey 



sup 



k=k n i+l 



(<J,9)eS» lM w^d)- 1 d:6 d es n , llk w i e o) 1 

x / cxp{l n (O)-l n (0 o )}dTr(d,9) 

JS n ,i, k 

<— E ^) e£fc / exp{i n (d)-i n (e )}p(e)de, 



fc=fc„ 1+1 



S„,i 



for all e > 0, on a set of probability going to 1 using (6.9 1 and (6.101. A Markov 
inequality implies that 



pn 



oo « oo 

E P(*K*/ exp{l n (0)-l n (O o )}dn(d,9)>u n E P( k ) e * k 

fe=fe„ >1 +l •>S n< i tk fc = fc„_i+l 



is 0{\/u n ) for all u n going to infinity, so that with probability going to 1, 



l ( e ) E ^ nA , k 



■ e 



— c\k n i+cne„ 



o(l) 



k=k n .i + l 

for some ci, c > 0. We finally obtain that 

E"» K(0)] =^ 2 (0 o )(l + Op (l))+ W - 1 (0 o ) e - c " e " Op (l), 

for some C > 0. Simple computations show that for all e > 0, 

Pn\logw(0 o )\>rf}=o(l), 

we can also make precise the upper bound on |logu;(0 o )| by controlling better 
|j4(0 o )|, leading to a term of order (logn) 3 , which terminates the proof. 
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